48 


Reliability, Redundancy, and Voting Systems 


WILLIAM GOBLE 


INTRODUCTION 

There are many variables that must be optimized in any 
engineering applications. Reliability should be one of those 
variables. Depending on the application, failures in an auto- 
mation system can result in human injury or loss of life. In 
some applications, automation system failure can result in 
severe negative environmental as well as significant eco- 
nomic impacts. Given this importance, a good understanding 
of the attributes of reliability is needed. 

There are several different metrics used in the fields of 
reliability and safety engineering. These include failure rate, 
mean time to failure (MTTF), Reliability, Availability, prob- 
ability of fail-safe (PFS), probability of fail-danger (PFD), 
and PFDavg. An understanding of these metrics is useful in 
the optimization of processes. 

FAILURES 

A failure is defined in many reliability engineering textbooks 
[1,2] as “When a device does not perform its intended func- 
tion when operated within its intended operating environ- 
ment.” A strong quality-oriented definition adds a reference 
to the need of a device designer to completely understand the 
intended environment of the device. That definition of failure 
which is preferred [3] is “When a device does not perform 
its intended function when operated within any reasonably 
foreseeable environment.” 

A failure occurs when a stress exceeds the associated 
strength. The stress is a combination of all the things cause 
failure; heat, humidity, shock, vibration, electrical voltage 
and current, corrosion, equipment operator errors, and main- 
tenance personnel errors. Two important issues always exist, 
the stress and the associated strength. This is not only a use- 
ful concept but the theory is being developed to more accu- 
rately predict failure rates and useful life of a device [4], 

The term failure rate is defined as the number of failures 
of a device per unit operating hours. Failure rate has units of 
inverse time. It is a common practice to use units of “failures/ 
billion hours.” This unit is known as “FIT” for Failure unIT. 

Many studies of actual failure rates as a function of oper- 
ating time interval show a plot as shown in Figure 48.1. This 
is known as “the bathtub curve.” 


The initial period of the curve where the failure rate is 
declining is called the “infant mortality” region. This declin- 
ing failure rate is due to weak components in a population 
that fail early due to lower (almost normal) stress levels. Once 
those weak units are removed from the population, only stron- 
ger devices remain and the failure rate declines until the weak 
units are removed from the population. After that relatively 
short time period, often only a few weeks at most, the failure 
rate remains somewhat steady or even declining. This period 
is known as the “useful life.” The useful life region typi- 
cally lasts many years. At some point in time, the failure rate 
increases and the devices have entered the “wearout” region. 
This is due to degradation mechanisms which slowly decrease 
the strength of each device. When the strength approaches 
normal stress levels, the failure rate goes up rapidly. 

Reliability models are typically created using the assump- 
tion of a constant failure rate. This significantly reduces the 
complexity of the mathematics and makes much of the analy- 
sis work practical. This is a reasonable and even conserva- 
tive assumption if the infant mortality phase and the wearout 
phase of the bathtub curve are excluded. Infant mortality is 
often excluded as manufacturers have developed techniques 
to accelerate this phase even before devices are shipped to 
consumers. Such programs are called “run-in” or “burn- 
in” and have been shown to be quite effective. The wearout 
phase is excluded by specifying a useful life limit as well as 
a constant failure rate during useful life. Thus, two numbers 
(Figure 48.2) bound the bathtub curve and allow practical 
reliability and safety modeling. 

RELIABILITY 

Reliability is a metric indicating probability of success- 
ful operation over an interval of time, the “mission time.” 
Reliability is defined as “the probability that a device will 
perform its intended function over an interval of time when 
required to do so when operated within any reasonably fore- 
seeable environment.” Mathematically, reliability ( R ) has a 
precise definition: “The probability that a device will be suc- 
cessful in the time interval from zero to t.” In terms of the 
random variable T, 

Rif) = P(T > t ) ( 48 . 1 ) 

736 


© 2012 by Bela Liptak 


48 Reliability, Redundancy, and Voting Systems 737 



FIG. 48.1 

Failure rate as a function of operating time interval — the “bathtub 
curve. ” 








J 

s 

■ 


3 

a> 

• 

"... 






Operating time interval ► 


FIG. 48.2 

Failure rate characterized by useful life limit and constant failure 
rate. 


no meaning. A meaningful statement would be “the reliabil- 
ity for a 1 year mission time equals 0.87.” 

Unreliability ( F ) is a metric indicating probability of 
unsuccessful operation over an interval of time. Unreliability 
is defined as “the probability that a device will fail to per- 
form its intended function sometime in the time interval from 
zero to t.” Since any component must be either successful or 
failed, Fit) is the one’s complement of Rif): 

F(t) = 1 — R(t) ( 48 . 2 ) 


MTTF 

The “time to failure" is the primary random variable in 
the field of reliability engineering. One of the first metrics 
defined was the average of this random variable, MTTF. This 
is a metric with variations as often some but not all of the 
failures are included. The most common usage is to include 
failures during the useful life only. 

Constant Failure Rate 

A common probability density function in the field of reli- 
ability engineering is the exponential. It was developed 
early in the history of the field from the constant failure rate 
assumption: 

If A (r; = A 

It can be shown that 


Like any probability, reliability is a number from zero to 
one. A reliability function for a device that is tested and fully 
functional is one a time equals zero. The reliability func- 
tion decreases as the time interval increases as illustrated in 
Figure 48.3. This tells us that eventually, if operated for a 
long enough period of time, everything will fail. 

Reliability is a function of device failure rates and operat- 
ing time interval. Since reliability is a function of operating 
time interval, it has meaning only when the time interval is 
specified. A statement like “the reliability equals 0.87” has 



FIG. 48.3 

Reliability function example. 


R{t) = e~ Xt 


( 48 . 3 ) 


and therefore 


F(t) = 1 - e~ lt ( 48 . 4 ) 

The constant failure rate is very characteristic of many 
kinds of products during the useful life period. In the stress- 
strength simulations, it can be seen that this is characteristic 
of a rather uniform constant strength and a random stress. 

The MTTF of a component with an exponential prob- 
ability density function can be derived from the definition 
of MTTF: 


MTTF = 


J R(t)dt 

o 


( 48 . 5 ) 


Substituting the exponential reliability function. 


MTTF = J e~ x, dt (48.6) 

o 


© 2012 by Bela Liptak 




738 Process Management, Maintenance, Safety, and Reliability 


and integrating, 

MTTF = --[Y*T (48.7) 

When the exponential is evaluated, the value as t 
approaches infinity is zero and the value at r=0 is one. 
Substituting these results, we have a solution: 

MTTF = - (48.8) 

A 

This is a common equation given for MTTF. However, it 
can be seen that this equation applies only during the useful 
life of a device and only for single devices or series systems. 

Availability 

Another metric was created specifically for repairable sys- 
tems. Availability is defined as the probability of successful 
operation at any moment in time. It is a metric that allows 
for failure and accounts for repair. Availability is a measure 
of “uptime” in a device and is represented by a function A(t). 
The availability metric is more suitable for automation sys- 
tems as these are typically ground based repairable systems. 

Availability is a function of device failure rates and repair 
rates. It is most often calculated using a Markov modeling 
technique. Some availability modeling techniques provide a 
time-dependent probability and some availability modeling 
techniques calculate an average or steady-state value for a 
given system. 

Unavailability, a measure of failure, is defined as “the 
probability that a device is not successful (is failed) at 
time t.” Unavailability is the one’s complement of availabil- 
ity; therefore, 

U(t) = \-A(t) (48.9) 


FAILURE MODES 

In some applications, failure metrics such as unreliability and 
unavailability are not sufficient as multiple failure modes are 
not distinguished by the metric. All failure modes contribute 
to unavailability. In situations where one or more particular 
failure modes are important, unavailability is partitioned into 
mutually exclusive failure modes. 

Any particular type of device can fail in more than one 
manner. The actual failure mode is often dependent on the 
function of the device. In actual applications, functional 
failure modes can be categorized and grouped. In automa- 
tion systems that provide an automatic protection function 
the failure modes are very important. Two particular groups 
of functional failure modes are classified as “fail-safe” and 
“fail-danger.” 


Fail-Safe 

IEC 61508-2000 [5], an international functional safety stan- 
dard, defines safe state as “state of the system when safety is 
achieved.” The objective of many automatic protection sys- 
tems is to sense a potentially dangerous condition in a process 
and take action to bring the system to a state “where safety is 
achieved.” In many critical applications, the designers choose 
a de-energized condition as the safe state. This translates to 
a requirement that a system designed for safety protection 
applications should de-energize its outputs to achieve a safe 
state. These systems are called normally energized. 

Consider an example microcomputer-based system that is 
normally energized (Figure 48.4). When the system is oper- 
ating successfully, the input switch is closed and energized, 
the input circuits translate the switch signal, the microcom- 
puter reads the input circuits, the microcomputer performs 
calculations, the microcomputer generates outputs, and the 
load is energized. Input switches are normally energized to 
indicate a safe condition. Output circuits supply energy to 
a load (typically, a valve or motor controller) which takes 
action only when de-energized. The sensor switch opens (de- 
energizes) to indicate a potentially dangerous condition. If 
the logic solver (typically a safety PLC) is programmed to 
recognize the sensor input as a potentially dangerous condi- 
tion it will de-energize its output(s). This action is designed 
to mitigate the potentially hazardous condition. 

When devices within the system fail in certain modes 
such as the output switch fails to open and the load is de- 
energized, those represent fail-safe failures. Another exam- 
ple of a fail-safe mode would be an open circuit of the wire 
between the sense switch and controller. 

Fail-Danger 

Dangerous failures in an automatic protection system are 
defined as those failures that prevent the system from 
responding to a potentially dangerous condition. There are 
many device failures within a system that might cause a fail- 
danger failure. In the example system from Figure 48.4, a 


+ + 



FIG. 48.4 

Fail-safe example. 


© 2012 by Bela Liptak 



48 Reliability, Redundancy, and Voting Systems 739 


short circuit failure of the sense switch is one example. PFD 
is defined as the probability of failing in the fail-danger 
mode. It is a subset of unavailability. In an automatic protec- 
tion system application where PFD varies with time and the 
need for protection occurs infrequently, the average of the 
PFD values is calculated. That is called PFDavg. 

Annunciation Failure 

Automatic diagnostic functions are commonly part of twenty- 
first century automation systems. The objective of any diag- 
nostic function is to detect abnormal operation of equipment, 
which is often the result of device failures. Some device fail- 
ures may fail the diagnostics. When these diagnostics stop 
working, the probability of safe failure or dangerous failure 
is increased. 

An annunciation failure is therefore defined as a failure 
that prevents automatic diagnostics from properly detecting 
or annunciating that a failure has occurred inside the equip- 
ment. Note that the failure may be within the equipment that 
fails or inside an external piece of equipment designed for the 
purpose of automatic diagnostics. 

RELIABILITY IMPROVEMENT— HIGH STRENGTH 

The probability of successful operation of a system can be 
improved by a number of methods. One of the simplest tech- 
niques is called high strength design. Given the concept of 
stress versus strength, the objective is to increase design 
strength to a level above even abnormal stress conditions. 
The first step is to make certain that the “reasonably foresee- 
able environment” has been defined. This should certainly 
include foreseeable random events including environmental 
stress, human errors, and random failures of support equip- 
ment. For example, a device that uses an air supply should 
be designed to tolerate occasionally dirty air that might be 
caused by clogging and bypass of an air filter. For existing 
devices that have been on the market for some period of 
time, manufacturers should look at all field failure returns 
especially those marked “not a failure” or “customer abuse.” 
These failure reports may have good insight into actual field 
stress conditions. 

For industrial control systems many stresses must be 
considered. A partial list is shown in Table 48.1 from Ref. [6] 
based on a number of different international standards. 


RELIABILITY IMPROVEMENT— REDUNDANCY 

Adding extra equipment in order for a system to perform 
an intended function even with device failures has been a 
common design practice for many decades. A common 
example is a set of two power supplies, each capable of sup- 
plying all needed electric current. If these two power supplies 
(Figure 48.5) feed a load through diodes, most failures of 


one power supply will not impact successful operation of the 
other power supply. 

The promise of redundancy to provide highly reliable 
systems has been a strong motivator in system designers. If 
the unavailability of a device is 0.01 and two devices can 
be designed into a system such that only one is needed, ide- 
ally the system availability is 0.0001 (0.01 xO.Ol). A Markov 
model of this ideal system is illustrated in Figure 48.6. Failure 
rates are indicated with the lambda symbol. Repair rates are 
indicated with the p. symbol. 

Diagnostics 

There are a number of very unrealistic assumptions in that 
simple analysis. The first unrealistic assumption is that fail- 
ures in a redundant system are recognized and repaired in 
the same time period as a non-redundant system. In a non- 
redundant system, a failure of a device like a power supply 
has a high likelihood of being immediately noticed as the 
“system” has failed. This is not so in a redundant system. If 
one power supply in the redundant power system of Figure 
48.3 blows a fuse and does not deliver power, the system 
continues to deliver current and has no failure. Flow does 
the failure of one power supply ever become recognized 
such that repairs are made? Unless automatic diagnostics 
are designed into a system to detect such failures, periodic 
manual testing must be done on each device. Often the test- 
ing itself requires that the redundancy be disabled during 
the test. This increases unavailability of the system. In the 
case of device failures detected by periodic manual test- 
ing, the unavailability numbers will be much higher. If the 
periodic manual testing is not done at all, the unavailabil- 
ity of the system must account for the unreliability of one 
device over the expected life time interval of the system. 
The actual decrease in unavailability in a redundant system 
depends on the failure rates of the devices and examples 
have shown that some redundant designs provide little bene- 
fit at all without automatic diagnostics or frequently manual 
testing. 

In some redundant designs, automatic diagnostics are 
used to control switching devices that route the device output 
from multiple devices to the system output. In the simplified 
design shown in Figure 48.7, two devices generate the needed 
system output and a diagnostic output indicating the “health” 
of the device. The switching device uses the two diagnostic 
signals to select which device output is sent to the system 
output. A logic table is shown in Table 48.2. 

What happens if the diagnostic signal does not detect 
a failure in a device? It will indicate that the device is OK 
when in fact it has failed. If this happens, the system will 
fail if that device is selected by the switch. In such redun- 
dant system designs, the effectiveness of the automatic diag- 
nostics will have a major impact on system unavailability. It 
has been shown ([6], Chapter 9) that when diagnostics drop 
<90% coverage, system unavailability approaches that of a 
non-redundant device. Given the added complexity and cost 


© 2012 by Bela Liptak 


740 Process Management , Maintenance, Safety, and Reliability 


TABLE 48.1 

Environmental Variables 

Specification Type 

Minimum Recommended Range 

Recommended Test 
Specification 

Operating temperature 

-10°C-60°C 


IEC 60068-2-2 test Bb 

Operating temperature change 

0.5°C/min 


IEC 60068-2-14 test Nb 

Storage temperature 

-40°C-85°C 


IEC 60068-2-1 test Ab, Ad 

Storage temperature change 

10°C/min 


IEC 60068-2-14 test Na 

Operating humidity 

5%-95% 

Non-cond. 

IEC 60068-2-3, Ca 

Storage humidity 

0%-100% 

Cond. 

IEC 60068-2-30, Dd 

Vibration 

KM 50 Hz 

2g 

IEC 60068-2-6, Fc 

Mechanical shock 

15 g for 11 ms 


IEC 60068-2-27, Ea 

Corrosive resistance 

G3 


ANSI/ISA S71.04 

Electrostatic discharge immunnity 

6kV contact 


IEC 61000-4-2 

Electrostatic discharge immunnity 

8kV air 


IEC 61000-4-2 

Radiated E-field immunity 

80-1 000 MHz 

20V/m 

IEC 61000-4-3 

Radiated E-field immunity 

1.4-2 GHz 

6V/m 

IEC 61000-4-3 

Radiated E-field immunity 

2-2.7 GHz 

3V/m 

IEC 61000-4-3 

Magnetic field 

30 A/m 


IEC 61000-4-8 

Signal line burst immunity (EFT) 

2kV 


IEC 61000-4-4 

Signal line surge immunity (EFT) 

2kV 


IEC 61000-4-5 

Signal line conducted RF immunity 

150 kHz-80 MHz 

10V 

IEC 61000-4-6 

Signal line conducted RF common mode immunity 

15-1 50 Hz 

10V 

IEC 61000-4-16 

Signal line conducted RF common mode immunity 

150 Hz-1.5 kHz 

IV 

IEC 61000-4-16 

Signal line conducted RF common mode immunity 

15-1 50 kHz 

10V 

IEC 61000-4-16 

AC/DC power line burst immunity 

4kV 


IEC 61000-4-4 

AC/DC power conducted RF immunity 

150 kHz-80 MHz 

10V 

IEC 61000-4-6 

AC power line surge immunity 

2 kV line to line 


IEC 61000-4-5 

AC power line surge immunity 

4 kV line to ground 


IEC 61000-4-5 

AC power conducted common mode RF immunity 

15-1 50 Hz 

10V 

IEC 61000-4-16 

AC power conducted common mode RF immunity 

150 Hz-1.5 kHz 

IV 

IEC 61000-4-16 

AC power conducted common mode RF immunity 

1.5-150 kHz 

10V 

IEC 61000-4-16 

AC power voltage dip immunity 

0.5 period 

30% reduction 

IEC 61000-4-11 

AC power voltage interruption 

250 periods 

95% reduction 

IEC 61000-4-11 

DC power line surge immunity 

lkV 

Line to line 

IEC 61000-4-5 

DC power line surge immunity 

2kV 

Line to ground 

IEC 61000-4-5 

DC power voltage dip immunity 

10ms 

60% reduction 

IEC 61000-4-29 

DC power voltage interruption 

30 ms 

100% reduction 

IEC 61000-4-29 

Functional earth burst immunity 

2kV 


IEC 61000-4-4 

Functional earth conducted common mode RF immunity 

15-1 50 Hz 

10V 

IEC 61000-4-16 

Functional earth conducted common mode RF immunity 

150 Hz-1.5 kHz 

IV 

IEC 61000-4-16 

Functional earth conducted common mode RF immunity 

1.5-150 kHz 

10V 

IEC 61000-4-16 


Source: Goble, W., Control System Safety Evaluation and Reliability , 3rd edn., ISA, Research Triangle Park, NC, 2010. © ISA. 
With permission. 


of redundancy, all redundant designs must consider highly 
effective automatic diagnostics as an essential requirement. 

Common Cause 

Another very unrealistic assumption in the ideal redundancy 
model of Figure 48.6 is the independence of failures. A com- 
mon-cause failure is defined as the failure of more than one 
device due to the same stress (cause). In reality, failures are 
not always independent. It is possible that multiple devices in 
a redundant system will fail due to a common stress. Imagine 


that the power supplies of Figure 48.3 are mounted in a single 
cabinet and a fault elsewhere in the cabinet generates heat. 
It is easy to imagine that the fuses or output transistors in 
both power supplies could fail in a relatively short period of 
time. This is a common-cause failure. What would happen 
if a power surge of high voltage entered both power supplies 
of Figure 48.5? If that voltage exceeded the strength of both 
devices, both would fail. 

A common-cause failure substantially increases system 
unavailability in a redundant system [7,8]. This is not a the- 
oretical problem. Actual study of failures by NASA on the 


© 2012 by Bela Liptak 


48 Reliability, Redundancy, and Voting Systems 741 



Redundant DC power supplies. 


2A A 



FIG. 48.6 

Markov model of ideal dual system. 



FIG. 48.7 

Redundant system with switching mechanism. 


TABLE 48.2 

System Output Control Logic 

Diagnostic A 

Diagnostic B 

System Output 

0 — A failed 

0 — B failed 

De-energize 

0 — A failed 

1— B OK 

B 

1 — A OK 

0 — B failed 

A 

1 — A OK 

1— B OK 

A or B 


U.S. Space Shuttle program showed that 11% of all failures 
resulted in failure of a redundant system [9]. Figure 48.8 
shows how a common-cause failure rate would be added to 
the ideal dual system of Figure 48.6. The symbol beta is used 
to show the percentage of the device failure rate that will 
result in common-cause failures (11% in the NASA example). 

Redundancy — Dual Architectures 

Some redundant designs can provide the intended function 
when one failure mode of a device occurs. Consider the dual 


2(1— P)A X 



Redundant system Markov model showing common-cause failures. 

power supply of Figure 48.5. Table 48.3 shows a simplified 
failure mode analysis of the devices and the impact on sys- 
tem performance for the dual power supply system. 

It can be seen from the device failure mode analysis that 
the design is effective against some device failures but not 
all. If one power supply fails with a high-output voltage, the 
system fails with a high-output voltage. 

Figure 48.9 shows a redundant system with two elec- 
tronic controller devices. The safety function of this system 
is to de-energize the output load. The output switches that 
control the load are wired in series. One controller can fail 
with its output switch failed short circuit and the system can 
still de-energize the load as the other switch can still open 
the circuit. 

This architecture is called '‘loo2” which stands for one 
out of two. Two controllers are available to perform the 


TABLE 48.3 

Device Failure Mode Analysis 


Device Failure Mode 

System Impact 


Power supply fails: Low 

None 

voltage/open circuit 

Power supply fails: High 

Fail: High voltage 

voltage 

Diode fails: Open circuit 

None 

Diode fails: Short circuit 

Power supply with shorted diode will 
supply most current, effectively system 
has degraded to non-redundant 


Power 

source 



FIG. 48.9 

Dual controller architecture — loo2. 


© 2012 by Bela Liptak 






742 Process Management , Maintenance, Safety, and Reliability 


TABLE 48.4 


Device Failure Mode Analysis — loo2 


Device Failure Mode 

System Impact 

Controller A fails: output open 
Controller A fails: output short 
Controller B fails: output open 
Controller B fails: output short 

System fails de-energized 
None 

System fails de-energized 
None 


function but only one is needed. The naming assumes a 
de-energize output safety function. Table 48.4 shows a device 
failure analysis for the system. Note that the system will fail 
even for a single device failure in the open circuit mode. 

If the function of the system is to energize the out- 
put, a different wiring can be used to provide redundancy. 
Figure 48.10 shows a dual architecture called “2oo2.” The 
naming convention refers to a dual de-energize to trip sys- 
tem where two units are present and two units are needed 
to de-energize the output. Table 48.5 shows a device failure 
mode analysis for this system. The system will tolerate a sin- 
gle device failure in the open circuit mode showing opposite 
behavior when compared to the loo2. 

Redundancy — Triple Architectures 



FIG. 48.11 

Triple controller architecture — 2oo3. 

The six outputs are wired into a “majority voting ” circuit. 
This circuit can tolerate one failure in either failure mode. 
For example, controller A fails with output short circuit. This 
condition is illustrated Figure 48.12. By careful observation 
of the electrical current paths, one can see that switches B1 
and Cl can still de-energize the circuit. Effectively, the cir- 
cuit has degraded into a 2oo2 using switches B1 and Cl. 

Figure 48.13 shows the situation where controller A has 
failed with output open circuit. It can be seen that the system 


Redundant equipment can be also designed to protect against 
both a single de-energize failure mode and a single energize 
failure mode. This can be done using three devices each with 
two output switches. Figure 48.11 shows a triple controller 
architecture called “2oo3.” 


Power 

source 


Control | 
input 


Electronic 
controller A 


Electronic 
controller B 


Load 


FIG . 48.10 

Dual controller architecture — 2oo2. 


TABLE 48.5 


Device Failure Mode Analysis 

— 2oo2 

Device Failure Mode 

System Impact 

Controller A fails: Output open 

None 

Controller A fails: Output short 

System fails energized 

Controller B fails: Output open 

None 

Controller B fails: Output short 

System fails energized 


Power 

source 



FIG . 48.12 

Majority voting circuit with controller A failed short circuit. 



Power 

source 



Load 


FIG. 48.13 

Majority voting circuit with controller A failed open circuit. 


© 2012 by Bela Liptak 






48 Reliability, Redundancy, and Voting Systems 743 


continues to operate and that switches B2 and C2. The failure 
has caused the system to degrade into a loo2 equivalent. 

Advanced Redundant Architectures 

In devices with microcomputers, automatic self-diagnostics 
can be designed into the device. When a redundant system is 
designed with such devices, new architectures are possible. 
Figure 48.14 shows an architecture known as 2oo2D. The let- 
ter D was added to indicate a diagnostic switch added to the 
output to control the behavior of the architecture [10]. 

The objective of the architecture was to turn short circuit 
failures of the controller and its output into open circuit failures 
if they are detected by automatic diagnostics such the diag- 
nostic switch opens. The redundancy was added with switches 
wired in parallel to avoid an open circuit failure of a controller. 
Detailed analysis has shown that the architecture works well in 
a de-energize to trip application if the automatic diagnostics are 
good enough. The architecture depends on having very good 
diagnostics with diagnostic coverage in the 95% plus range. 

Another version of this architecture is called the loo2D. 
This is shown in Figure 48.15. Extra diagnostic control cir- 
cuits such that each diagnostic circuit could not open its 
own diagnostic output switch but also its partners diagnostic 
switch. This combined with output readback diagnostics and 
comparison diagnostics provided a bias toward the de-ener- 
gize to trip mode at the expense of operation in de-energize 
to trip applications. 

Advanced Hybrid Architectures 

In the twenty-first century, designs have moved toward hybrid 
combinations of these classic faults tolerant architectures. 
The design choice is often based on application, the ability 
to implement effective automatic diagnostics and market 
demand. Table 48.6 shows a listing of various architectures 
and their fault- tolerance capabilities. 


Power 

source 


Control 

input 


Diagnostic output — > 

Electronic controller A y 
output 1 



Diagnostic output — ^ 

Electronic controller B y 
output 1 



Load 


FIG. 48.14 

2oo2D voting architecture. 


Control 

input 


Diagnostic output 


Electronic controller A 
• output 

Output readback <- 


Comparison diagnostic 


Diagnostic output 

Electronic controller B 
output 

Output readback <- 


Power 

source 


Load 


FIG. 48.15 

loo2D voting architecture. 


RELIABILITY IMPROVEMENT— VOTING SYSTEM 

Redundancy is the term used for multiple devices perform- 
ing the same function. Voting is the term used to describe 
how the information from redundant devices is applied. One 
can have redundant devices without voting, or with manual 
voting if an operator can check the output of the redundant 
devices and decide which output is to perform the required 
function. For example, if a single sensor/logic/effecter chain 
has inadequate reliability, redundant systems can be brought 
into action to reduce the probability of failure on demand 
or to reduce the probability of false tripping or to yield both 
functions [11]. 

There are several approaches available for redundancy 
and voting, and the correct choice of method depends on the 
relative sensitivity of the systems to the two failure modes, 
in economic and human-factor risks, and the cost and the 
practicality of proof testing. 

A simple example of redundancy and voting is the tra- 
ditional temperature measurement installation, where a 
bimetallic thermometer is installed in one thermowell, and 
a duplex thermocouple element is installed in a second well, 
with one element connected to an alarm/shutdown system, 
and the other to the process control system. Here, the ther- 
mometer can provide local operator indication, and the two 
thermocouple elements offer redundant information to the 
control room operator. This system also demonstrates one of 
the weaknesses of redundant systems common mode failure. 
Both of the duplex elements are identical, and are closely 
coupled mechanically and electrically. Drift in calibration 
(such as from manganese migration between sheaths and ele- 
ments), phase change hysteresis, or straight mechanical dam- 
age can produce similar effects in both components. A slip 
during maintenance can easily take out both measurements 


© 2012 by Bela Liptak 


744 Process Management , Maintenance, Safety, and Reliability 


TABLE 48.6 

Architecture Comparison 

Architecture 

Number of Units 

Output Switches 

Safety Fault Tolerance 

Availability Fault 
Tolerance 

Objective 

loo2 

2 

2 

i 

0 

High safety 

2oo2 

2 

2 

0 

1 

Maintain output 

2oo3 

3 

6(4*) 

1 

1 

Safety and availability 

2oo2D 

2 

4 

0 — failure not detected 

1 

Safety and availability — bias toward 




1 — failure detected 


availability 

loo2D 

2 

4 

1 

0 — failure not detected 

Safety and availability — bias toward safety 





1 — failure detected 



Some commercial implementations use four switches. 


at once and it is impossible to repair one element without 
taking the other out of service. 

Voting as a Control Strategy 

Voting of analog information can be useful in control as well 
as in safety systems. Many modern control systems carry a 
quality bit from each input through their calculations, and 
this can be used to switch calculations on failure of input. 

One of the more useful tools is the use of median signal 
select. The median of an odd number of the signal is the value 
that has an equal number of signals larger and smaller than the 
designated one. Thus, if we consider three transmitters mea- 
suring a common variable, with measurements 25%, 26%, and 
30%, the median signal is 26%. The median is less sensitive 
to gross errors than the mean or average, and will allow a 
system to continue operating in reasonable control if any one 
device fails either high or low. The median of three signals 
can be computed by selecting the highest of the outputs from 
three low signal selects. Thus, for signals A, B, and C, the 

MEDIAN! A, 5,C) = MA X(X,Y,Z) 

where 

X=MIN(A, B) 

K= MINIS, C) 

Z=MIN(C, A) 

The same result can be obtained from the lowest of three 
high selects. See Figure 48.16. For an even number of inputs, 
greater than three, the median is taken as the average of the 
two most central points. However, care must be taken in using 
median signal selection with a derivative action controller, as 
the switching between transmitters must cause a step change 
in the rate of change of the measurement. 

With three identically ranged transmitters, one can 
extract redundant data to determine high alarm, low alarm, 
and control measurement. The median signal becomes the 
control measurement, the highest good-quality signal gives 
the high alarm, and the lowest good-quality signal gives the 
low alarm. “Good quality” is normally taken as between 0% 
and 100%, but this may be narrowed considerably by using 




Time 


FIG. 48.16 

Median signal select. 

other information, such as being <3 standard deviations away 
from the historical mean of all signals, or extracted from the 
on-board diagnostic messages from a “smart” transmitter. 

References 

1. Billinton, R. and Allan, R.N., Reliability Evaluation of 
Engineering Systems: Concepts and Techniques, New York: 
Plenum Press, 1983. 

2. Dhillon, B.S., Reliability Engineering in Systems Design and 
Operation , New York: Van Nostrand Reinhold, 1983. 


© 2012 by Bela Liptak 







48 Reliability, Redundancy, and Voting Systems 745 


3. Goble, W. and van Beurden, I., An Introduction to Reliability 
and Safety Engineering, Sellersville, PA: Exida, 2010. 

4. Brombacher, A.C., Reliability by Design — CAE Techniques for 
Electronic Components and Systems, New York: John Wiley & 
Sons, 1992. 

5. IEC 61508, Functional Safety of Electrical/Electronic/ 
Programmable Electronic Safety-Related Systems, 1st 
edn., Geneva, Switzerland: International Electro-Technical 
Commission, 2000. 

6. Goble, W., Control System Safety Evaluation and Reliability, 
3rd edn.. Research Triangle Park, NC: ISA, 2010. 

7. Gray, J., A census of tandem availability between 1985 and 
1990, IEEE Transactions on Reliability, 39(4), October 1990. 


8. Siewiorek, D.P., Fault tolerance in commercial computers. 
Computer Magazine, IEEE Computer Society, July 3(7), 
26-37, 1990. 

9. Rutledge, P.J. and Mosleh, A., Dependent failures in space- 
craft: Root causes, coupling factors, defenses, and design 
implications, 1995 Proceedings of the Annual Reliability and 
Maintainability Symposium, New York, IEEE, 1995. 

10. Goble, W.M.. Evaluating Control Systems Reliability: 
Techniques and Applications, Research Triangle Park, NC: 
ISA, 1992. 

1 1. Gibson, I.H., Redundant or voting systems for increased reli- 
ability, Process Software and Digital Networks, Vol. 3, 3rd 
edn., Boca Raton, FL: CRC Press, 2002. 


© 2012 by Bela Liptak 



